Your browser doesn't support javascript.
Show: 20 | 50 | 100
Results 1 - 6 de 6
Filter
1.
JAMIA Open ; 5(4): ooac083, 2022 Dec.
Article in English | MEDLINE | ID: covidwho-2062926

ABSTRACT

Background: One of the increasingly accepted methods to evaluate the privacy of synthetic data is by measuring the risk of membership disclosure. This is a measure of the F1 accuracy that an adversary would correctly ascertain that a target individual from the same population as the real data is in the dataset used to train the generative model, and is commonly estimated using a data partitioning methodology with a 0.5 partitioning parameter. Objective: Validate the membership disclosure F1 score, evaluate and improve the parametrization of the partitioning method, and provide a benchmark for its interpretation. Materials and methods: We performed a simulated membership disclosure attack on 4 population datasets: an Ontario COVID-19 dataset, a state hospital discharge dataset, a national health survey, and an international COVID-19 behavioral survey. Two generative methods were evaluated: sequential synthesis and a generative adversarial network. A theoretical analysis and a simulation were used to determine the correct partitioning parameter that would give the same F1 score as a ground truth simulated membership disclosure attack. Results: The default 0.5 parameter can give quite inaccurate membership disclosure values. The proportion of records from the training dataset in the attack dataset must be equal to the sampling fraction of the real dataset from the population. The approach is demonstrated on 7 clinical trial datasets. Conclusions: Our proposed parameterization, as well as interpretation and generative model training guidance provide a theoretically and empirically grounded basis for evaluating and managing membership disclosure risk for synthetic data.

2.
PLoS One ; 17(6): e0269097, 2022.
Article in English | MEDLINE | ID: covidwho-1963000

ABSTRACT

BACKGROUND: One common way to share health data for secondary analysis while meeting increasingly strict privacy regulations is to de-identify it. To demonstrate that the risk of re-identification is acceptably low, re-identification risk metrics are used. There is a dearth of good risk estimators modeling the attack scenario where an adversary selects a record from the microdata sample and attempts to match it with individuals in the population. OBJECTIVES: Develop an accurate risk estimator for the sample-to-population attack. METHODS: A type of estimator based on creating a synthetic variant of a population dataset was developed to estimate the re-identification risk for an adversary performing a sample-to-population attack. The accuracy of the estimator was evaluated through a simulation on four different datasets in terms of estimation error. Two estimators were considered, a Gaussian copula and a d-vine copula. They were compared against three other estimators proposed in the literature. RESULTS: Taking the average of the two copula estimates consistently had a median error below 0.05 across all sampling fractions and true risk values. This was significantly more accurate than existing methods. A sensitivity analysis of the estimator accuracy based on variation in input parameter accuracy provides further application guidance. The estimator was then used to assess re-identification risk and de-identify a large Ontario COVID-19 behavioral survey dataset. CONCLUSIONS: The average of two copula estimators consistently provides the most accurate re-identification risk estimate and can serve as a good basis for managing privacy risks when data are de-identified and shared.


Subject(s)
COVID-19 , COVID-19/epidemiology , Humans , Information Dissemination , Privacy , Probability , Risk
3.
BMJ Open ; 12(5): e050450, 2022 05 18.
Article in English | MEDLINE | ID: covidwho-1950128

ABSTRACT

OBJECTIVE: To examine sex and gender roles in COVID-19 test positivity and hospitalisation in sex-stratified predictive models using machine learning. DESIGN: Cross-sectional study. SETTING: UK Biobank prospective cohort. PARTICIPANTS: Participants tested between 16 March 2020 and 18 May 2020 were analysed. MAIN OUTCOME MEASURES: The endpoints of the study were COVID-19 test positivity and hospitalisation. Forty-two individuals' demographics, psychosocial factors and comorbidities were used as likely determinants of outcomes. Gradient boosting machine was used for building prediction models. RESULTS: Of 4510 individuals tested (51.2% female, mean age=68.5±8.9 years), 29.4% tested positive. Males were more likely to be positive than females (31.6% vs 27.3%, p=0.001). In females, living in more deprived areas, lower income, increased low-density lipoprotein (LDL) to high-density lipoprotein (HDL) ratio, working night shifts and living with a greater number of family members were associated with a higher likelihood of COVID-19 positive test. While in males, greater body mass index and LDL to HDL ratio were the factors associated with a positive test. Older age and adverse cardiometabolic characteristics were the most prominent variables associated with hospitalisation of test-positive patients in both overall and sex-stratified models. CONCLUSION: High-risk jobs, crowded living arrangements and living in deprived areas were associated with increased COVID-19 infection in females, while high-risk cardiometabolic characteristics were more influential in males. Gender-related factors have a greater impact on females; hence, they should be considered in identifying priority groups for COVID-19 infection vaccination campaigns.


Subject(s)
COVID-19 , Cardiovascular Diseases , Aged , Biological Specimen Banks , COVID-19/epidemiology , Cross-Sectional Studies , Female , Hospitalization , Humans , Machine Learning , Male , Middle Aged , Prospective Studies , United Kingdom/epidemiology
4.
Int J Health Geogr ; 21(1): 1, 2022 01 19.
Article in English | MEDLINE | ID: covidwho-1633795

ABSTRACT

This article provides a state-of-the-art summary of location privacy issues and geoprivacy-preserving methods in public health interventions and health research involving disaggregate geographic data about individuals. Synthetic data generation (from real data using machine learning) is discussed in detail as a promising privacy-preserving approach. To fully achieve their goals, privacy-preserving methods should form part of a wider comprehensive socio-technical framework for the appropriate disclosure, use and dissemination of data containing personal identifiable information. Select highlights are also presented from a related December 2021 AAG (American Association of Geographers) webinar that explored ethical and other issues surrounding the use of geospatial data to address public health issues during challenging crises, such as the COVID-19 pandemic.


Subject(s)
COVID-19 , Privacy , Confidentiality , Humans , Pandemics , Public Health , SARS-CoV-2 , Social Justice
5.
JAMIA Open ; 4(1): ooab012, 2021 Jan.
Article in English | MEDLINE | ID: covidwho-1123295

ABSTRACT

BACKGROUND: Concerns about patient privacy have limited access to COVID-19 datasets. Data synthesis is one approach for making such data broadly available to the research community in a privacy protective manner. OBJECTIVES: Evaluate the utility of synthetic data by comparing analysis results between real and synthetic data. METHODS: A gradient boosted classification tree was built to predict death using Ontario's 90 514 COVID-19 case records linked with community comorbidity, demographic, and socioeconomic characteristics. Model accuracy and relationships were evaluated, as well as privacy risks. The same model was developed on a synthesized dataset and compared to one from the original data. RESULTS: The AUROC and AUPRC for the real data model were 0.945 [95% confidence interval (CI), 0.941-0.948] and 0.34 (95% CI, 0.313-0.368), respectively. The synthetic data model had AUROC and AUPRC of 0.94 (95% CI, 0.936-0.944) and 0.313 (95% CI, 0.286-0.342) with confidence interval overlap of 45.05% and 52.02% when compared with the real data. The most important predictors of death for the real and synthetic models were in descending order: age, days since January 1, 2020, type of exposure, and gender. The functional relationships were similar between the two data sets. Attribute disclosure risks were 0.0585, and membership disclosure risk was low. CONCLUSIONS: This synthetic dataset could be used as a proxy for the real dataset.

6.
J Med Internet Res ; 22(11): e23139, 2020 11 16.
Article in English | MEDLINE | ID: covidwho-945539

ABSTRACT

BACKGROUND: There has been growing interest in data synthesis for enabling the sharing of data for secondary analysis; however, there is a need for a comprehensive privacy risk model for fully synthetic data: If the generative models have been overfit, then it is possible to identify individuals from synthetic data and learn something new about them. OBJECTIVE: The purpose of this study is to develop and apply a methodology for evaluating the identity disclosure risks of fully synthetic data. METHODS: A full risk model is presented, which evaluates both identity disclosure and the ability of an adversary to learn something new if there is a match between a synthetic record and a real person. We term this "meaningful identity disclosure risk." The model is applied on samples from the Washington State Hospital discharge database (2007) and the Canadian COVID-19 cases database. Both of these datasets were synthesized using a sequential decision tree process commonly used to synthesize health and social science data. RESULTS: The meaningful identity disclosure risk for both of these synthesized samples was below the commonly used 0.09 risk threshold (0.0198 and 0.0086, respectively), and 4 times and 5 times lower than the risk values for the original datasets, respectively. CONCLUSIONS: We have presented a comprehensive identity disclosure risk model for fully synthetic data. The results for this synthesis method on 2 datasets demonstrate that synthesis can reduce meaningful identity disclosure risks considerably. The risk model can be applied in the future to evaluate the privacy of fully synthetic data.


Subject(s)
COVID-19/epidemiology , Disclosure/standards , Information Dissemination/methods , Humans , Reproducibility of Results , Risk Factors , SARS-CoV-2
SELECTION OF CITATIONS
SEARCH DETAIL